Environment: Analysis

This section will serve as a first pass at using some methods in the literature to aggregate metrics. I should say at the start that we have a pretty narrow selection of metrics to work with so far, which do not do a great job at capturing the breadth of the dimension. I’m also working with just the county-level data here. This provides some opportunities to use data-driven analyses like PCA, but it is worth noting that these will not get us to the holistic, system-wide measurements of sustainability we are after without including some normative judgments as to how to combine geographic areas as well as our five dimensions. So, let’s just go through the motions here, see how the process unfolds, and note anything worth digging into more down the road.

1 Imputation

PCA requires complete data, so we either have to impute, delete, or use PPCA. I’m choosing to impute with missing forest here as it is pretty good at handling MAR and non-linear data, but PPCA is certainly worth exploring.

Code
pacman::p_load(
  missForest,
  tibble
)
source('dev/filter_fips.R')
env_county <- readRDS('data/temp/env_county.rds')

# Wrangle dataset. Need all numeric vars or factor vars. And can't be tibble
# Also removing character vars - can't use these in PCA
# Using old Connecticut counties - some lulc data is missing for them though
dat <- env_county %>%
  filter_fips('old') %>%
  select(fips, where(is.numeric)) %>%
  column_to_rownames('fips') %>%
  as.data.frame()
# get_str(dat)
# skimr::skim(dat)

# Remove variables with most missing data - too much to impute.
# Also remove the proportional LULC values - keeping diversity though
dat <- dat %>%
  select(-matches('consIncome'), -matches('^lulcProp'))

# Impute missing variables
set.seed(42)
mf_out <- dat %>%
  missForest(
    ntree = 200,
    mtry = 10,
    verbose = FALSE,
    variablewise = FALSE
  )

# Save imputed dataset
imp <- mf_out$ximp

# Print OOB
mf_out$OOBerror
     NRMSE 
0.00142119 

2 Standardization

Centering and scaling to give every variable a mean of 0 and SD of 1.

Code
dat <- map_dfc(imp, ~ scale(.x, center = TRUE, scale = TRUE))

Now that we have standardized variables, we have to make normative decisions about what constitutes a good or bad value. This will certainly be a collaborative process where we seek input from teams to come to some kind of consensus once we have primary data. But until then, I’m going to make some heroic assumptions that LULC diversity is good, above ground forest biomass is good, conservation practices and easements are good, and fertilizer expenses are bad. Open to thoughts here as always.

With that, we can recode our normalized variables accordingly.

Code
normed <- dat %>%
  mutate(across(c(matches('^fert')), ~ -.x))

3 Component Extraction

Determine the number of components to extract using a few tools: very simple structure (VSS), Velicer’s minimum average partial (MAP) test, parallel analysis, and a scree plot.

Code
pacman::p_load(
  psych
)
VSS(normed)


Very Simple Structure
Call: vss(x = x, n = n, rotate = rotate, diagonal = diagonal, fm = fm, 
    n.obs = n.obs, plot = plot, title = title, use = use, cor = cor)
VSS complexity 1 achieves a maximimum of 0.79  with  1  factors
VSS complexity 2 achieves a maximimum of 0.94  with  2  factors

The Velicer MAP achieves a minimum of 0.06  with  8  factors 
BIC achieves a minimum of  -212.51  with  8  factors
Sample Size adjusted BIC achieves a minimum of  190.51  with  8  factors

Statistics by number of factors 
  vss1 vss2   map dof chisq
1 0.79 0.00 0.157 275  1633
2 0.76 0.94 0.081 251  1083
3 0.73 0.94 0.077 228   897
4 0.60 0.88 0.081 206   722
5 0.55 0.85 0.063 185   597
6 0.53 0.85 0.063 165   547
7 0.52 0.82 0.062 146   429
8 0.52 0.79 0.060 128   326
                                                                                                                                                                                                prob
1 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000026
2 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000098945650053411355068916366306552845344413071870803833007812500000000000000000000000000000
3 0.000000000000000000000000000000000000000000000000000000000000000000000000000000039129475697953917264458600921983588705188594758510589599609375000000000000000000000000000000000000000000000000000
4 0.000000000000000000000000000000000000000000000000000000000199731474755364932938822564167935524892527610063552856445312500000000000000000000000000000000000000000000000000000000000000000000000000
5 0.000000000000000000000000000000000000000000007276720351213096420976006450942463743558619171380996704101562500000000000000000000000000000000000000000000000000000000000000000000000000000000000000
6 0.000000000000000000000000000000000000000001728048955548177157337183529328683562198420986533164978027343750000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
7 0.000000000000000000000000000014378040859718844080537836216393543509184382855892181396484375000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
8 0.000000000000000000340581713196040165598371673993938202329445630311965942382812500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
  sqresid  fit RMSEA  BIC SABIC complex eChisq  SRMR eCRMS eBIC
1   34.26 0.79  0.27  476  1342     1.0 1687.8 0.205 0.214  531
2    9.70 0.94  0.22   28   818     1.2  348.0 0.093 0.102 -707
3    5.48 0.97  0.21  -61   657     1.4  170.5 0.065 0.075 -788
4    3.90 0.98  0.19 -145   504     1.7  105.4 0.051 0.062 -761
5    2.57 0.98  0.18 -181   402     1.8   55.9 0.037 0.047 -722
6    1.42 0.99  0.19 -147   373     1.8   24.9 0.025 0.034 -669
7    0.89 0.99  0.17 -185   274     1.8   12.9 0.018 0.026 -601
8    0.59 1.00  0.15 -213   191     1.8    6.6 0.013 0.020 -532
Code
fa.parallel(normed)

Parallel analysis suggests that the number of factors =  3  and the number of components =  3 
Code
pca_out <- pca(normed, nfactors = 3, rotate = 'varimax')
plot(pca_out$values)
abline(h = 1)

This scree plot shows the eigenvalues (unit variance explained) of each principal component (y-axis) against each component (x-axis). The first few components explain lots of variance, but there is a decent elbow around the fourth component.

VSS suggests 1 or 2, MAP suggests 8, parallel analysis shows 3. I’m going with 3 here, which will be explained further below.

4 Principal Components Analysis

Now we let’s look run the PCA.

Code
(pca_out <- pca(normed, nfactors = 3, rotate = 'varimax'))
Principal Components Analysis
Call: principal(r = r, nfactors = nfactors, residuals = residuals, 
    rotate = rotate, n.obs = n.obs, covar = covar, scores = scores, 
    missing = missing, impute = impute, oblique.scores = oblique.scores, 
    method = method, use = use, cor = cor, correct = 0.5, weight = NULL)
Standardized loadings (pattern matrix) based upon correlation matrix
                            RC1   RC2   RC3   h2    u2 com
lulcDiversity              0.03  0.58 -0.13 0.35 0.647 1.1
meanAboveGrndForBiomass   -0.36  0.54  0.01 0.42 0.580 1.7
alleyCropSilvapastureNOps  0.18  0.64  0.53 0.72 0.276 2.1
consEasementAcres          0.06  0.39  0.83 0.84 0.161 1.4
consEasementAcresPF        0.26 -0.08  0.78 0.68 0.317 1.2
consEasementNOps          -0.15  0.79  0.35 0.76 0.235 1.5
consTillExclNoTillAcres    0.94  0.14  0.10 0.92 0.081 1.1
consTillExclNoTillAcresPF  0.92 -0.03  0.12 0.85 0.146 1.0
consTillExclNoTillNOps     0.34  0.82  0.16 0.82 0.181 1.4
consTillNoTillAcres        0.70  0.29  0.36 0.71 0.288 1.9
consTillNoTillAcresPF      0.64 -0.05  0.33 0.52 0.478 1.5
consTillNoTillNOps         0.17  0.89  0.13 0.83 0.168 1.1
coverCropExclCrpAcres      0.91  0.20 -0.04 0.86 0.139 1.1
coverCropExclCrpAcresPF    0.89  0.00  0.13 0.81 0.190 1.0
coverCropExclCrpNOps       0.30  0.87  0.03 0.84 0.158 1.2
drainedDitchesAcres        0.92  0.17  0.04 0.89 0.112 1.1
drainedDitchesAcresPF      0.94  0.06  0.06 0.88 0.116 1.0
drainedDitchesNOps         0.47  0.51  0.15 0.51 0.492 2.2
drainedTileAcres           0.69  0.06  0.37 0.61 0.392 1.5
drainedTileAcresPF         0.68 -0.08  0.40 0.63 0.366 1.6
drainedTileNOps            0.71  0.39  0.32 0.75 0.249 2.0
precisionAgNOps            0.69  0.53 -0.14 0.77 0.226 2.0
rotateIntenseGrazeNOps     0.19  0.71  0.47 0.77 0.228 1.9
fertExpenseTotal          -0.86 -0.31  0.18 0.87 0.131 1.4
fertExpenseOpsWithExp     -0.12 -0.93  0.08 0.88 0.121 1.0

                       RC1  RC2  RC3
SS loadings           9.38 6.39 2.75
Proportion Var        0.38 0.26 0.11
Cumulative Var        0.38 0.63 0.74
Proportion Explained  0.51 0.34 0.15
Cumulative Proportion 0.51 0.85 1.00

Mean item complexity =  1.4
Test of the hypothesis that 3 components are sufficient.

The root mean square of the residuals (RMSR) is  0.07 
 with the empirical chi square  192.26  with prob <  0.96 

Fit based upon off diagonal values = 0.98

Recommendations for creating composite indices are to extract components that each have eigenvalues > 1, explained variance > 0.10, and such that the proportion of explained variance for the total set is > 0.60 (Nicoletti 2000; OECD 2008).

Our total cumulative variance is explained is 0.74, and our component that explains the least variance is RC4 with 0.11. Note that extracting four or more components here gives us a component with less than 0.10, so this is why we are sticking to three. The first component (RC1) explains 38% of the variance in the data. The second component is respectable at 0.26, while the third is barely above the threshold at 0.11.

Looking at the metrics, we can see that the first component loads mostly onto the conservation practices, no-till acres, cover cropping, drainage, and total fertilizer expenses. The second component leads onto mean above-ground biomass (although there is cross-loading with the first component), operations with silvapasture, operations with easements, rotational grazing operations, and operations with fertilizer expenses. This seems to be catching more of the population-related metrics. The last component only loads onto a few metrics: easement acres, easement acres per farm, and silvapasture operations (which has some heavy cross-loading).

5 Aggregation

Here, we follow Nicoletti and calculate the normalized sum of square factor loadings, which represent the proportion of total unit variance of the metrics that is explained by the component.

Code
## Get metric weights following Nicoletti 2000
# Pull out metric loadings
loadings <- pca_out$weights %>%
  as.data.frame()

# For each set of loadings, get squares, and then normalized proportions
sq_loadings <- map(loadings, ~ .x^2)
metric_weights <- map(sq_loadings, ~ .x / sum(.x))
head(as.data.frame(metric_weights))
           RC1          RC2          RC3
1 0.0005619064 0.0752164042 0.0258282884
2 0.0345317828 0.0690265012 0.0003647945
3 0.0101384970 0.0285992549 0.0687543999
4 0.0307134305 0.0001543834 0.2569873792
5 0.0022917387 0.0388343663 0.2597289090
6 0.0390099137 0.0853674939 0.0244854925

Now we can use these to weight metrics and aggregate them into a component score for each county.

Code
# Component scores for each component across each county
component_scores <- map(metric_weights, \(x) {
  as.matrix(normed) %*% x
}) %>%
  as.data.frame()
head(component_scores)
          RC1         RC2         RC3
1 -0.14329669 -0.19176506 -0.40741468
2 -0.14822774  0.07187198 -0.17694498
3 -0.25276927 -0.44271379 -0.29656260
4 -0.34052455 -0.21438065 -0.04465999
5  0.06911446  0.16985119 -0.20467504
6 -0.42230232 -0.58407610 -0.59714488

An alternative method here is regression scores, which are native to PCA. I’ll calculate these as well to compare to the component scores above.

Code
# Get regression scores from pca output
regression_scores <- as.data.frame(pca_out$scores)
head(regression_scores)
          RC1         RC2        RC3
1  0.07619635 -0.47060359 -0.4575708
2 -0.27948884  0.62365888 -0.5172847
3 -0.06844224 -0.82645311  0.2309383
4 -0.41986627 -0.08548724 -0.1010532
5  0.36215173  0.19477898 -0.4475573
6 -0.31849605 -0.90986747 -0.3048266

Running a correlation to see how similar they are:

Code
coefs <- map2_dbl(component_scores, regression_scores, \(x, y) cor(x, y)) %>%
  round(3)
cat(paste0(
  'Pearson Correlation Coefficients:\n',
  'RC1: ', coefs[1], '\n',
  'RC2: ', coefs[2], '\n',
  'RC3: ', coefs[3]
))
Pearson Correlation Coefficients:
RC1: 0.925
RC2: 0.83
RC3: 0.855

It looks like they are reasonably similar, although RC2 and RC3 have substantially lower correlation coefficients. It will be worth noting this and coming back to explore the differences at some point.

For now, let’s keep following Nicoletti and aggregate the component scores into a single variable.

Code
sum_sq_loadings <- map_dbl(sq_loadings, ~ sum(.x))
(factor_weights <- map_dbl(sum_sq_loadings, ~ .x / (sum(sum_sq_loadings))))
      RC1       RC2       RC3 
0.1681632 0.2456354 0.5862013 

Curious that the component that accounted for the most variance is weighted the lowest. Worth doing a dive here at some point and figuring out why that is.

We will use these to weight each component to combine them.

Code
dimension_scores <- component_scores %>%
  rowwise() %>%
  mutate(
    dimension_score = sum(RC1, RC2, RC3),
    across(everything(), ~ round(.x, 3))
  ) %>%
  bind_cols(rownames(imp)) %>%
  select(fips = 5, everything())
head(dimension_scores)
# A tibble: 6 × 5
# Rowwise: 
  fips     RC1    RC2    RC3 dimension_score
  <chr>  <dbl>  <dbl>  <dbl>           <dbl>
1 23001 -0.143 -0.192 -0.407          -0.742
2 23005 -0.148  0.072 -0.177          -0.253
3 23007 -0.253 -0.443 -0.297          -0.992
4 23009 -0.341 -0.214 -0.045          -0.6  
5 23011  0.069  0.17  -0.205           0.034
6 23013 -0.422 -0.584 -0.597          -1.60 

Now that we have all three component scores and the dimension score, let’s take a look at a map. Select the data to display with the layer button on the left.

Code
pacman::p_load(
  mapview,
  leaflet,
  leafpop
)
map_dat <- readRDS('data/sm_data.rds')[['ne_counties_2021']] %>%
  right_join(dimension_scores) %>%
  left_join(fips_key) %>%
  select(
    fips,
    RC1:RC3,
    'Dimension Score' = dimension_score,
    County = county_name,
    State = state_name,
    geometry
  )

map_dat %>%
  mapview(
    zcol = c(
      'Dimension Score',
      'RC1',
      'RC2',
      'RC3'
    ),
    burst = FALSE,
    hide = c(FALSE, rep(TRUE, 3)),
    popup = popupTable(
      map_dat,
      zcol = names(map_dat)[-length(map_dat)],
      row.numbers = FALSE,
      feature.id = FALSE
  )
)

Keep in mind there are lots of caveats with this very preliminary analysis, the most egregious being a set of metrics that does not well represent the dimension it purports to measure. Missing data and various branching paths of decisions in the index scoring also deserve further scrutiny.

Still, there is plenty to look at here as a first pass at aggregating dimension scores. The first component, RC1, was heavily influenced by the geography - it loads the strongest onto metrics measuring acres or acres per farm. I presume this is why Aroostook county shows up so high on this scale. RC2 loaded strongly onto the number of operations using various conservation practices (easements, no-till, rotational grazing). It seems to track a little bit with county size, but is highest near relatively urban areas. RC3 was most associated with conservation easement acres and easement acres per farm, and consequently seems to track with rural areas.

I don’t think that the dimension score inspires much confidence as it is now. The weighting method for combining components is hard to interpret intuitively, and I think more expert driven normative decisions might make more sense at that point. On the bright side, it is a good expedition into the kinds of ambiguous decisions that will need to be made to aggregate this data across the whole system.

Back to top

References

Nicoletti, Giuseppe. 2000. “Summary Indicators of Product Market Regulation with an Extension to Employment Protection Legislation.” {{OECD Economics Department Working Papers}} 226. Vol. 226. OECD Economics Department Working Papers. https://doi.org/10.1787/215182844604.
OECD. 2008. Handbook on Constructing Composite Indicators: Methodology and User Guide. Paris: Organisation for Economic Co-operation and Development.